애니메이션 산업 데이터 분석¶

이 노트북에서는 대규모 애니메이션 데이터를 심층적으로 분석합니다.
이 분석은 충북대학교 컴퓨터공학과 NLP 강좌의 Final Project를 위해 개발된
애니메이션 추천 웹사이트에 추가 기능으로 제공되며,
사용자들에게 애니메이션 산업의 트렌드와 흥미로운 정보를 제공합니다.


주요 목표¶

데이터셋을 분석하고 다음 질문들에 답을 찾는 것을 목표로 합니다:

  • 어떤 애니메이션이 다양한 장르와 주제에서 가장 인기 있는가?
  • 애니메이션 산업에서 현재 트렌드가 되고 있는 장르와 주제는 무엇인가?
  • 어떤 애니메이션 스튜디오와 제작사가 성공을 거두었는가?
  • 사용자들이 애니메이션을 어떻게 평가했으며, 그 평가에서 어떤 흥미로운 결론을 도출할 수 있는가?

분석 과정¶

🔍 데이터 시각화와 세부적인 분석을 통해:¶

  1. 인기 있는 애니메이션 장르와 주제를 파악합니다.
  2. 애니메이션 산업의 발전 방향을 연구합니다.
  3. 추천 시스템을 사용자에게 더 흥미롭고 유용하게 만들기 위해 주요 트렌드를 밝혀냅니다.

-Let's get started with the data exploration!

라이브러리 임포트¶

In [13]:
!pip install plotly
Requirement already satisfied: plotly in /opt/anaconda3/lib/python3.12/site-packages (5.24.1)
Requirement already satisfied: tenacity>=6.2.0 in /opt/anaconda3/lib/python3.12/site-packages (from plotly) (8.2.3)
Requirement already satisfied: packaging in /opt/anaconda3/lib/python3.12/site-packages (from plotly) (24.1)
In [15]:
!pip install wordcloud
Requirement already satisfied: wordcloud in /opt/anaconda3/lib/python3.12/site-packages (1.9.4)
Requirement already satisfied: numpy>=1.6.1 in /opt/anaconda3/lib/python3.12/site-packages (from wordcloud) (1.26.4)
Requirement already satisfied: pillow in /opt/anaconda3/lib/python3.12/site-packages (from wordcloud) (10.4.0)
Requirement already satisfied: matplotlib in /opt/anaconda3/lib/python3.12/site-packages (from wordcloud) (3.9.2)
Requirement already satisfied: contourpy>=1.0.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (1.4.4)
Requirement already satisfied: packaging>=20.0 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (24.1)
Requirement already satisfied: pyparsing>=2.3.1 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /opt/anaconda3/lib/python3.12/site-packages (from matplotlib->wordcloud) (2.9.0.post0)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/lib/python3.12/site-packages (from python-dateutil>=2.7->matplotlib->wordcloud) (1.16.0)
In [17]:
!pip install langdetect
Requirement already satisfied: langdetect in /opt/anaconda3/lib/python3.12/site-packages (1.0.9)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.12/site-packages (from langdetect) (1.16.0)
In [19]:
# Reading Dataset
import numpy as np
import pandas as pd
# Visualization
import plotly.express as px
import plotly.graph_objects as go  # for 3D plot visualization
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

from wordcloud import WordCloud
from langdetect import detect
from datetime import datetime

데이터셋 불러오기(Reading our Dataset)¶

In [21]:
# Setting column display to 50
pd.set_option('display.max_columns', 50)
In [24]:
# Importing anime details dataframe
df_anime=pd.read_csv('/Users/giyos/Downloads/archive (1)/anime-dataset-2023.csv')
print("Shape of the Dataset:",df_anime.shape)
df_anime.head(3)
Shape of the Dataset: (24905, 24)
Out[24]:
anime_id Name English name Other name Score Genres Synopsis Type Episodes Aired Premiered Status Producers Licensors Studios Source Duration Rating Rank Popularity Favorites Scored By Members Image URL
0 1 Cowboy Bebop Cowboy Bebop カウボーイビバップ 8.75 Action, Award Winning, Sci-Fi Crime is timeless. By the year 2071, humanity ... TV 26.0 Apr 3, 1998 to Apr 24, 1999 spring 1998 Finished Airing Bandai Visual Funimation, Bandai Entertainment Sunrise Original 24 min per ep R - 17+ (violence & profanity) 41.0 43 78525 914193.0 1771505 https://cdn.myanimelist.net/images/anime/4/196...
1 5 Cowboy Bebop: Tengoku no Tobira Cowboy Bebop: The Movie カウボーイビバップ 天国の扉 8.38 Action, Sci-Fi Another day, another bounty—such is the life o... Movie 1.0 Sep 1, 2001 UNKNOWN Finished Airing Sunrise, Bandai Visual Sony Pictures Entertainment Bones Original 1 hr 55 min R - 17+ (violence & profanity) 189.0 602 1448 206248.0 360978 https://cdn.myanimelist.net/images/anime/1439/...
2 6 Trigun Trigun トライガン 8.22 Action, Adventure, Sci-Fi Vash the Stampede is the man with a $$60,000,0... TV 26.0 Apr 1, 1998 to Sep 30, 1998 spring 1998 Finished Airing Victor Entertainment Funimation, Geneon Entertainment USA Madhouse Manga 24 min per ep PG-13 - Teens 13 or older 328.0 246 15035 356739.0 727252 https://cdn.myanimelist.net/images/anime/7/203...
In [26]:
# Importing user details dataframe
df_user=pd.read_csv('/Users/giyos/Downloads/archive (1)/users-details-2023.csv')
print("Shape of the Dataset:",df_user.shape)
df_user.head()
Shape of the Dataset: (731290, 16)
Out[26]:
Mal ID Username Gender Birthday Location Joined Days Watched Mean Score Watching Completed On Hold Dropped Plan to Watch Total Entries Rewatched Episodes Watched
0 1 Xinil Male 1985-03-04T00:00:00+00:00 California 2004-11-05T00:00:00+00:00 142.3 7.37 1.0 233.0 8.0 93.0 64.0 399.0 60.0 8458.0
1 3 Aokaado Male NaN Oslo, Norway 2004-11-11T00:00:00+00:00 68.6 7.34 23.0 137.0 99.0 44.0 40.0 343.0 15.0 4072.0
2 4 Crystal Female NaN Melbourne, Australia 2004-11-13T00:00:00+00:00 212.8 6.68 16.0 636.0 303.0 0.0 45.0 1000.0 10.0 12781.0
3 9 Arcane NaN NaN NaN 2004-12-05T00:00:00+00:00 30.0 7.71 5.0 54.0 4.0 3.0 0.0 66.0 0.0 1817.0
4 18 Mad NaN NaN NaN 2005-01-03T00:00:00+00:00 52.0 6.27 1.0 114.0 10.0 5.0 23.0 153.0 42.0 3038.0
In [28]:
# Importing user score dataframe
df_score=pd.read_csv('/Users/giyos/Downloads/archive (1)/users-score-2023.csv')
print("Shape of the dataset:",df_score.shape)
df_score.head()
Shape of the dataset: (24325191, 5)
Out[28]:
user_id Username anime_id Anime Title rating
0 1 Xinil 21 One Piece 9
1 1 Xinil 48 .hack//Sign 7
2 1 Xinil 320 A Kite 5
3 1 Xinil 49 Aa! Megami-sama! 8
4 1 Xinil 304 Aa! Megami-sama! Movie 8

데이터 분석¶

데이터 탐색¶

각 DataFrame 확인¶

데이터를 더 잘 이해하기 위해 각 DataFrame을 개별적으로 확인하는 것이 중요합니다. 이는 DataFrame의 구조를 평가하고 누락된 값을 식별하는 과정을 포함합니다. 우리는 info() 메서드를 사용하여 이 과정을 시작할 것이며, 이는 DataFrame의 열과 구조에 대한 종합적인 개요를 제공합니다.

In [32]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import numpy as np
from PIL import Image
In [33]:
df_anime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24905 entries, 0 to 24904
Data columns (total 24 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   anime_id      24905 non-null  int64 
 1   Name          24905 non-null  object
 2   English name  24905 non-null  object
 3   Other name    24905 non-null  object
 4   Score         24905 non-null  object
 5   Genres        24905 non-null  object
 6   Synopsis      24905 non-null  object
 7   Type          24905 non-null  object
 8   Episodes      24905 non-null  object
 9   Aired         24905 non-null  object
 10  Premiered     24905 non-null  object
 11  Status        24905 non-null  object
 12  Producers     24905 non-null  object
 13  Licensors     24905 non-null  object
 14  Studios       24905 non-null  object
 15  Source        24905 non-null  object
 16  Duration      24905 non-null  object
 17  Rating        24905 non-null  object
 18  Rank          24905 non-null  object
 19  Popularity    24905 non-null  int64 
 20  Favorites     24905 non-null  int64 
 21  Scored By     24905 non-null  object
 22  Members       24905 non-null  int64 
 23  Image URL     24905 non-null  object
dtypes: int64(4), object(20)
memory usage: 4.6+ MB
In [34]:
# Preprocessing Score column
df_anime['Score'].value_counts()
Out[34]:
Score
UNKNOWN    9213
6.31         80
6.54         80
6.25         79
6.51         79
           ... 
3.21          1
3.29          1
1.85          1
3.69          1
4.07          1
Name: count, Length: 567, dtype: int64
In [39]:
scores = df_anime['Score'][df_anime['Score'] != 'UNKNOWN']
scores = scores.astype('float')
score_mean= round(scores.mean() , 2)
In [41]:
df_anime['Score'] = df_anime['Score'].replace('UNKNOWN', score_mean)
df_anime['Score'] = df_anime['Score'].astype('float64')
In [43]:
# Processing Ranked column
df_anime['Rank'].value_counts()
Out[43]:
Rank
UNKNOWN    4612
0.0         187
6542.0        4
16675.0       4
6577.0        4
           ... 
18424.0       1
18423.0       1
11642.0       1
8977.0        1
14536.0       1
Name: count, Length: 15198, dtype: int64
In [45]:
df_anime['Rank'] = df_anime['Rank'].replace('UNKNOWN', np.nan)
df_anime['Rank'] = df_anime['Rank'].astype('float64')
In [47]:
df_user.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731290 entries, 0 to 731289
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Mal ID            731290 non-null  int64  
 1   Username          731289 non-null  object 
 2   Gender            224383 non-null  object 
 3   Birthday          168068 non-null  object 
 4   Location          152805 non-null  object 
 5   Joined            731290 non-null  object 
 6   Days Watched      731282 non-null  float64
 7   Mean Score        731282 non-null  float64
 8   Watching          731282 non-null  float64
 9   Completed         731282 non-null  float64
 10  On Hold           731282 non-null  float64
 11  Dropped           731282 non-null  float64
 12  Plan to Watch     731282 non-null  float64
 13  Total Entries     731282 non-null  float64
 14  Rewatched         731282 non-null  float64
 15  Episodes Watched  731282 non-null  float64
dtypes: float64(10), int64(1), object(5)
memory usage: 89.3+ MB
In [49]:
df_score.isnull().sum()
Out[49]:
user_id          0
Username       232
anime_id         0
Anime Title      0
rating           0
dtype: int64

Data Visualization¶

For Anime Dataset¶

In [53]:
# Count the number of anime titles by type
type_counts = df_anime['Type'].value_counts()

# Create a bar chart
fig = px.bar(type_counts, x=type_counts.index, y=type_counts.values, color=type_counts.index, labels={'x':'Anime Type', 'y':'Count'}, 
             title='Count of Anime Titles by Type')

fig.show()
In [55]:
# Filter out anime titles with popularity value 0
df_valid_popularity = df_anime[df_anime['Popularity'] > 0]

# Sort the dataframe by popularity and select the top 15
top_10_popular = df_valid_popularity.sort_values(by='Popularity', ascending=True).head(15)

# Create a bar chart with different colors for each bar
fig = px.bar(top_10_popular, x='Name', y='Popularity',
             labels={'Name': 'Anime Title', 'Popularity': 'Popularity'},
             title='Top 15 Most Popular Animes',
             color='Name')
# Note:- Less the popularity no. is more popular is the anime.
fig.show()
In [57]:
# Create a scatter plot
fig = px.scatter(df_anime, x='Score', y='Members', 
                 labels={'Score':'Overall Score', 'Members':'Number of Scores'}, 
                 title='Anime Score vs. Number of Scores')

fig.show()
In [59]:
# Sort the dataframe by the number of users who have scored the anime
top_15_scored = df_anime.sort_values(by='Members', ascending=False).head(15)

# Create a bar chart
fig = px.bar(top_15_scored, x='Name', y='Members', labels={'Members':'Number of Users', 'Name':'Anime Title'},color='Name',
             title='Top 15 Animes by Number of Users')

fig.show()
In [61]:
# Split the genres and count their occurrences
genre_counts = df_anime[df_anime['Genres'] != "UNKNOWN"]['Genres'].apply(lambda x: x.split(', ')).explode().value_counts()

# Create a bar chart
fig = px.bar(genre_counts, x=genre_counts.index, y=genre_counts.values,
             labels={'x':'Genre', 'y':'Count'},
             title='Count of Anime Titles by Genre',
             color=genre_counts.index)

fig.show()
In [63]:
# Select the top 20 genres
top_20_genres = genre_counts.head(20)

# Create a bar chart with custom style
fig = px.bar(top_20_genres, x=top_20_genres.index, y=top_20_genres.values,
             labels={'x':'Genre', 'y':'Count'},
             title='Top 20 Most Popular Genres In The Anime Industry')

# Customize the bar chart appearance
fig.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.8)

fig.update_layout(xaxis_tickangle=-45, xaxis=dict(tickfont=dict(size=12)),
                  yaxis=dict(titlefont=dict(size=14)))

fig.show()
In [65]:
# Create the plotly figure
fig = go.Figure(data=[go.Pie(labels=top_20_genres.index, values=top_20_genres.values,
                             hole=0.6, hoverinfo='label+percent', textinfo='value')])

fig.update_layout(title='Distribution of Anime Genres',
                  legend=dict(font=dict(size=12), title='Genre'),
                  annotations=[dict(text='Genre', x=0.5, y=0.5, font_size=20, showarrow=False)])

fig.show()
In [67]:
# Concatenate all genre values into a single string
genre_text = ' '.join(df_anime[df_anime['Genres'] != "UNKNOWN"]['Genres'].dropna())

# Create a WordCloud object
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(genre_text)

# Convert the WordCloud object to an image
wordcloud_image = wordcloud.to_image()

# Create a Plotly figure to display the WordCloud image
fig = go.Figure(go.Image(z=wordcloud_image))
fig.update_layout(title='Word Embedding Plot - Genre')
fig.show()
In [69]:
# Create a violin plot for anime popularity by type
fig = px.violin(df_anime, x='Type', y='Popularity', 
                labels={'Type':'Anime Type', 'Popularity':'Popularity'},
                title='Distribution of Anime Popularity by Type',
                color='Type')

fig.show()
In [71]:
# Create a box plot for anime scores by type
fig = px.box(df_anime, x='Type', y='Score', 
             labels={'Type':'Anime Type', 'Score':'Score'},
             title='Distribution of Anime Scores by Type',
             color='Type')

fig.show()
In [73]:
# Create a bubble chart to visualize the relationship between popularity and scored_by
fig = px.scatter(df_anime, x='Popularity', y='Members', size='Score', color='Type',
                 labels={'Popularity':'Popularity', 'Members':'Number of Scores'},
                 title='Relationship between Popularity, Number of Scores, and Score')

fig.show()
In [75]:
# Create a 3D scatter plot to visualize the relationship between popularity, scored_by, and score
fig = go.Figure(data=go.Scatter3d(
    x=df_anime['Popularity'],
    y=df_anime['Members'],
    z=df_anime['Score'],
    mode='markers',
    marker=dict(
        size=5,
        color=df_anime['Rank'],
        colorscale='Viridis',
        opacity=0.8
    ),
    text=df_anime['Name'],
    hovertemplate='<b>Title</b>: %{text}<br><b>Popularity</b>: %{x}<br><b>Scored By</b>: %{y}<br><b>Score</b>: %{z}',
))

fig.update_layout(scene=dict(
    xaxis_title='Popularity',
    yaxis_title='Scored By',
    zaxis_title='Score'
), title='Relationship between Popularity, Scored By, and Score')

fig.show()
In [77]:
# Create a correlation matrix
correlation_matrix = df_anime[['Score', 'Popularity', 'Rank']].corr()

# Create a heatmap of the correlation matrix
fig = ff.create_annotated_heatmap(z=correlation_matrix.values,
                                  x=list(correlation_matrix.columns),
                                  y=list(correlation_matrix.index),
                                  colorscale='Viridis')
fig.update_layout(title='Correlation Matrix')
fig.show()
In [79]:
df_anime['Licensors'].value_counts()
Out[79]:
Licensors
UNKNOWN                                                                  20170
Funimation                                                                 957
Sentai Filmworks                                                           818
Discotek Media                                                             275
Aniplex of America                                                         222
                                                                         ...  
Bandai Entertainment, Maiden Japan                                           1
ADV Films, SoftCel Pictures                                                  1
VIZ Media, Media Blasters, Sentai Filmworks, Geneon Entertainment USA        1
Bandai Entertainment, Discotek Media, NYAV Post, Bandai Visual USA           1
Bandai Namco Online                                                          1
Name: count, Length: 265, dtype: int64
In [162]:
# Create a list of all the individual licensors
licensors_list = [licensor.strip() for licensors in df_anime[df_anime['Licensors']!="UNKNOWN"]['Licensors'].str.split(',') for licensor in licensors]

# Count the occurrences of each licensor
licensor_counts = pd.Series(licensors_list).value_counts()

# Filter the licensor_counts series to exclude 'Unknown'
filtered_licensor_counts = licensor_counts[licensor_counts.index != 'Unknown']

# Select the top 10 licensors
top_15_licensors = filtered_licensor_counts.head(10)

# Create the bar plot using Plotly
fig = px.bar(top_15_licensors, x=top_15_licensors.index, y=top_15_licensors.values, color=top_15_licensors.index)

# Customize the plot
fig.update_layout(
    title='Top 10 Anime Licensors',
    xaxis_title='Licensors',
    yaxis_title='Count',
    xaxis_tickangle=-45
)

# Show the plot
fig.show()
In [154]:
df_anime['Premiered'].value_counts()
Out[154]:
Premiered
UNKNOWN        19399
spring 2017       88
fall 2016         83
spring 2018       81
spring 2016       78
               ...  
summer 1993        1
summer 1974        1
summer 1991        1
spring 1961        1
summer 2025        1
Name: count, Length: 244, dtype: int64
In [160]:
# Create the pie plot
fig = go.Figure(data=go.Pie(
    labels=season_counts.index,
    values=season_counts.values,
    hole=0.4,  # Add a donut hole in the center
    hoverinfo='label+percent',  # Display label and percentage on hover
    textinfo='value',  # Display count value as text inside each slice
    textfont=dict(size=14),  # Set the text font size
    marker=dict(
        colors=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd'],  # Custom color palette
        line=dict(color='#ffffff', width=2)  # Set the color and width of the slice borders
    )
))

# Set the title and font style for the plot
fig.update_layout(
    title='Distribution of Premiered Seasons',
    title_font=dict(size=20),
    font=dict(size=12, color='#555555')
)

fig.show()
In [89]:
# Filter out None values from premiered_Year
filtered_premiered_year = premiered_Year.dropna()

# Count the occurrences of each year
year_counts = filtered_premiered_year.value_counts()

# Sort the years in ascending order
sorted_years = sorted(year_counts.index)

# Create the bar plot
fig = go.Figure(data=go.Bar(
    x=sorted_years,
    y=year_counts[sorted_years],
    marker=dict(color='#1f77b4'),  # Set the color of the bars
))

# Set the title and axis labels
fig.update_layout(
    title='Number of Animes Premiered by Year',
    xaxis_title='Year',
    yaxis_title='Number of Animes',
    title_font=dict(size=20),
    font=dict(size=12, color='#555555')
)

fig.show()
In [91]:
# Count the occurrences of each studio
studio_counts = df_anime['Studios'].value_counts()

# Filter the studio_counts series to exclude 'Unknown'
studio_counts = studio_counts[studio_counts.index != 'UNKNOWN']

# Select the top 10 studios with the highest number of animes
top_studios = studio_counts.head(10)

# Create the bar plot
fig = go.Figure(data=go.Bar(
    x=top_studios.index,
    y=top_studios.values,
    marker=dict(color=top_studios.values, colorscale='Blues'),  # Set the color of the bars using a colorscale
    text=top_studios.values,  # Set the text to be displayed on hover
    hovertemplate='Studio: %{x}<br>Number of Animes: %{y}<extra></extra>',  # Customize the hover template
))

# Set the title and axis labels
fig.update_layout(
    title='Number of Animes by Studio (Top 10)',
    xaxis_title='Studios',
    yaxis_title='Number of Animes',
    title_font=dict(size=20),
    font=dict(size=12, color='#555555'),
    plot_bgcolor='rgba(0, 0, 0, 0)'  # Set the background color to transparent
)

fig.show()
In [93]:
# Count the occurrences of each source
source_counts = df_anime['Source'].value_counts()

# Filter the source_counts series to exclude 'Unknown'
source_counts = source_counts[source_counts.index != 'UNKNOWN']

# Create the horizontal bar chart
fig = go.Figure(data=go.Bar(
    x=source_counts.values,
    y=source_counts.index,
    orientation='h',  # Set the orientation to horizontal
    marker=dict(color=source_counts.values, colorscale='Viridis'),  # Set the color of the bars using a colorscale
    text=source_counts.values,  # Set the text to be displayed on hover
    hovertemplate='Source: %{y}<br>Number of Animes: %{x}<extra></extra>',  # Customize the hover template
))

# Set the title and axis labels
fig.update_layout(
    title='Number of Animes by Source',
    xaxis_title='Number of Animes',
    yaxis_title='Source',
    title_font=dict(size=20),
    font=dict(size=12, color='#555555')
)

fig.show()
In [95]:
# Sort the DataFrame by the 'Favorites' column in descending order
sorted_df = df_anime.sort_values('Favorites', ascending=False)

# Select the top 10 most favorited anime
top_favorites = sorted_df.head(10)

# Create the horizontal bar chart
fig = go.Figure(data=go.Bar(
    x=top_favorites['Favorites'],
    y=top_favorites['Name'],
    orientation='h',  # Set the orientation to horizontal
    marker=dict(color='#1f77b4'),  # Set the color of the bars
    text=top_favorites['Favorites'],  # Set the text to be displayed on hover
    hovertemplate='Anime: %{y}<br>Favorites: %{x}<extra></extra>',  # Customize the hover template
))

# Set the title and axis labels
fig.update_layout(
    title='Top 10 Most Favorited Anime',
    xaxis_title='Number of Favorites',
    yaxis_title='Anime',
    title_font=dict(size=20),
    font=dict(size=12, color='#555555')
)

fig.show()
In [97]:
# Creating the treemap plot too fr the above code snippet
fig = go.Figure(go.Treemap(
    labels=top_favorites['Name'],
    parents=[""] * len(top_favorites),
    values=top_favorites['Favorites'],
    hovertemplate='Name: %{label}<br>Favorites: %{value}',
))

# Set the color scale
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
          '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
fig.update_traces(marker=dict(colors=colors))

# Set the title
fig.update_layout(
    title='Top 10 Most Favorited Anime (Treemap)',
    title_font=dict(size=20),
    font=dict(size=12, color='#555555'),
)

fig.show()
In [99]:
# Count the occurrences of each rating
rating_counts = df_anime[df_anime['Rating']!="UNKNOWN"]['Rating'].value_counts()

# Filter the rating_counts series to exclude 'Unknown'
rating_counts = rating_counts[rating_counts.index != 'Unknown']

# Create the pie plot
fig = go.Figure(data=go.Pie(
    labels=rating_counts.index,
    values=rating_counts.values,
    hoverinfo='label+percent',
    textinfo='value',
    textfont=dict(size=12),
    marker=dict(colors=['#1f77b4']),  # Set the same color for all segments
    hole=0.6,  # Set the size of the inner hole to create a donut shape
))

# Set the title
fig.update_layout(
    title='Distribution of Anime Ratings',
    title_font=dict(size=20),
    font=dict(size=12, color='#555555'),
)

fig.show()
In [132]:
# Apply language detection to the 'Other name' column
Detected_Language = df_anime[df_anime['Other name']!="UNKNOWN"]['Other name'].apply(detect_language)

# Drop rows where language detection failed (i.e., where Detected_Language is None)
Detected_Language = Detected_Language.dropna()

# Count the occurrences of each language
language_counts = Detected_Language.value_counts()

# Map abbreviated language codes to full names for plotting
language_counts.index = language_counts.index.map(map_language_code)

fig = go.Figure(data=go.Bar(
    x=language_counts.values,
    y=language_counts.index,
    orientation='h',
    marker=dict(color=language_counts.values, colorscale='Viridis'),
    text=language_counts.values,  # Set the text to be displayed on hover
    hovertemplate='Native Language: %{y}<br>Number of Animes: %{x}<extra></extra>',
))

# Set the title and axis labels
fig.update_layout(
    title='Count of Animes based on its Native Name',
    xaxis_title='Number of Animes',
    yaxis_title='Native Language',
    title_font=dict(size=20),
    font=dict(size=12, color='#555555')
)

fig.show()

For User Dataset¶

In [105]:
# Distribution of gender
# Count the occurrences of each gender
gender_counts = df_user['Gender'].value_counts(dropna=True)

# Define custom colors for the pie slices
colors = ['rgb(0, 123, 255)', 'rgb(255, 65, 54)', 'rgb(255, 187, 0)', 'rgb(125, 125, 125)']

# Create the pie plot
fig = go.Figure()

fig.add_trace(go.Pie(
    labels=gender_counts.index,
    values=gender_counts.values,
    hole=0.3,
    marker=dict(colors=colors, line=dict(color='#FFFFFF', width=2)),
    hoverinfo='label+percent',
    hovertemplate='<b>%{label}</b><br>%{percent}',
    textinfo='value',
    textposition='inside',
    sort=False
))

# Customize the layout
fig.update_layout(
    title='Gender Distribution',
    title_x=0.5,
    uniformtext_minsize=12,
    uniformtext_mode='hide',
    showlegend=False,
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    margin=dict(l=20, r=20, t=100, b=20),
)

# Show the plot
fig.show()
In [106]:
df_user['Birthday'].value_counts(dropna=True)
Out[106]:
Birthday
1990-01-01T00:00:00+00:00    177
1989-03-26T00:00:00+00:00    169
1980-01-01T00:00:00+00:00    166
1930-01-01T00:00:00+00:00    153
1991-01-01T00:00:00+00:00    115
                            ... 
1966-12-06T00:00:00+00:00      1
2001-11-08T00:00:00+00:00      1
1954-10-16T00:00:00+00:00      1
1958-03-13T00:00:00+00:00      1
2000-10-13T00:00:00+00:00      1
Name: count, Length: 11247, dtype: int64
In [107]:
from datetime import datetime, timezone
# Age Distribution
# Convert birthday to age
def calculate_age(birth_date):
    if birth_date != 'NaN':
        try:
            birth_year = int(birth_date.split('-')[0])  # Extract the birth year
            today_year = datetime.now(timezone.utc).year  # Use timezone-aware UTC
            age = today_year - birth_year
            if 10 <= age < 60:  # Valid age range (modify as needed)
                return age
            else:
                return None
        except:
            return None
    return None

# Apply age calculation to the 'Birthday' column
Age = df_user['Birthday'].dropna().apply(calculate_age)

# Create the histogram
import plotly.express as px

fig = px.histogram(Age, nbins=20, title='Age Distribution', labels={'value': 'Age', 'count': 'Count'})

# Customize the layout
fig.update_layout(
    xaxis=dict(title='Age'),
    yaxis=dict(title='Count'),
    bargap=0.1,
    showlegend=False,
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    margin=dict(l=50, r=20, t=100, b=50),
)

# Show the plot
fig.show()
In [108]:
# Location analysis
# Count the occurrences of each location
location_counts = df_user['Location'].value_counts()

# Create a bar chart
fig = px.bar(location_counts.head(20),
             x=location_counts.head(20).index,
             y=location_counts.head(20).values,
             labels={'x': 'Location', 'y': 'Count'},
             title='Top 20 User Locations',
             color=location_counts.head(20).index)

# Customize the layout
fig.update_layout(
    xaxis=dict(title='Location'),
    yaxis=dict(title='Count'),
    bargap=0.1,
    showlegend=False,
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    margin=dict(l=50, r=20, t=100, b=50),
)

# Show the plot
fig.show()
In [111]:
# Create a correlation matrix
correlation_matrix = df_user[['Days Watched', 'Mean Score', 'Total Entries', 'Rewatched', 'Episodes Watched']].corr()

# Create a heatmap of the correlation matrix
fig = ff.create_annotated_heatmap(z=correlation_matrix.values,
                                  x=list(correlation_matrix.columns),
                                  y=list(correlation_matrix.index),
                                  colorscale='Viridis')
fig.update_layout(title='Correlation Matrix')
fig.show()

For User Score Dataset¶

In [113]:
# Animes that was watched by most users in the df_score dataset

# Get the count of users who watched each anime title
anime_watch_count = df_score.groupby('Anime Title')['user_id'].nunique().reset_index()
anime_watch_count = anime_watch_count.rename(columns={'user_id': 'User Count'})

# Sort the dataframe in descending order by the number of users
anime_watch_count = anime_watch_count.sort_values(by='User Count', ascending=False)

# Select the top 10 anime titles with the highest number of users
top_n = 10
top_anime_watch_count = anime_watch_count.head(top_n)

# Define a colorful color palette
color_palette = px.colors.qualitative.Plotly

# Create the bar chart with colorful bars
fig = px.bar(top_anime_watch_count, x='User Count', y='Anime Title', orientation='h',
             title=f'Top {top_n} Anime Titles Watched by Most Users',
             labels={'User Count': 'Number of Users', 'Anime Title': 'Anime Title'},
             color='User Count',
             color_discrete_sequence=color_palette)

# Customize the layout
fig.update_layout(showlegend=False, paper_bgcolor='rgba(0,0,0,0)', plot_bgcolor='rgba(0,0,0,0)',
                  margin=dict(l=50, r=20, t=100, b=50))

# Show the plot
fig.show()

THANK YOU¶